Dataset analysis (general + geospacial using H3 + folium)¶

This quick task is to show what quick data analysis with python. Data is anonymized and concerns taxi like services.

There are two data sets

data_orders - contains data on user orders and includes columns such as:

  • order_datetime - order date and time
  • origin_longitude - longitude
  • origin_latitude - latitude
  • m_order_eta - time before driver arrives
  • order_gk - order id
  • order_status_key - status (4 - canceled by client, 9 - canceld by system (reject))
  • is_driver_assigned_key - whether a driver has been assigned
  • cancellations_time_in_seconds - how many seconds elapsed before cancellation

data_offers - contains pairs of order number - ID of the driver to whom the order was offered.

At the moment when the client clicks on the "Order" button in the application, the matching system looks for the most relevant drivers and offers them an order. The task is to investigate some matching metrics for orders that did not complete successfully (the client did not receive the car in the end).

Imports¶

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.core.display import HTML

0. Data preparation

In [4]:
df_offers = pd.read_csv("data_offers.csv")  
df_orders = pd.read_csv("data_orders.csv")  

df_orders.dtypes
Out[4]:
order_datetime                    object
origin_longitude                 float64
origin_latitude                  float64
m_order_eta                      float64
order_gk                           int64
order_status_key                   int64
is_driver_assigned_key             int64
cancellations_time_in_seconds    float64
dtype: object
In [5]:
df_orders['order_datetime_DT'] = df_orders['order_datetime'].apply(pd.to_datetime)
df_orders['order_daytime'] = pd.to_datetime(df_orders['order_datetime_DT']).dt.hour

Distribution of orders by reasons of failure: cancellations before and after the appointment of a driver, rejects

In [6]:
df_orders['order_status_key'].value_counts().plot(kind='bar')
Out[6]:
<AxesSubplot:>

Conclusions

The largest number of orders is canceled because of client (Over twice more than cancelled by the system).

Graph of the distribution of fails by hours

In [7]:
fig = px.histogram(df_orders, x="order_daytime", color="order_status_key", marginal="violin")
fig.show()

The most fail hours are in the morning between 7 and 9 am with it's peak at 8 am and in the night between 9pm and 3 am.

Conclusions

Explanation for this can be that those hours are busy hours, therefore statisticly there are the most fails. To confirm this thesis a comparison to the total orders is required. For the fails between 9pm and 3am the same thesis as above should be checked.

Plot average time to cancellation (cancellations_time_in_seconds) with and without a driver, by the hour. If there are outliers in the data, it is better to remove them

In [8]:
df_cancellation_not_blank = df_orders.dropna(subset=['cancellations_time_in_seconds'])
df_cancellation_not_blank = df_cancellation_not_blank
df_cancellation_not_blank.describe()
Out[8]:
origin_longitude origin_latitude m_order_eta order_gk order_status_key is_driver_assigned_key cancellations_time_in_seconds order_daytime
count 7307.000000 7307.000000 2811.000000 7.307000e+03 7307.0 7307.000000 7307.000000 7307.000000
mean -0.964020 51.450557 441.610459 3.000597e+12 4.0 0.384700 157.892021 12.646093
std 0.021619 0.011558 288.057123 2.436466e+07 0.0 0.486558 213.366963 7.425299
min -1.066952 51.399523 60.000000 3.000550e+12 4.0 0.000000 3.000000 0.000000
25% -0.973939 51.444823 233.000000 3.000583e+12 4.0 0.000000 45.000000 7.000000
50% -0.966654 51.452237 372.000000 3.000594e+12 4.0 0.000000 98.000000 13.000000
75% -0.949864 51.456704 653.500000 3.000623e+12 4.0 1.000000 187.500000 20.000000
max -0.867088 51.496169 1559.000000 3.000633e+12 4.0 1.000000 4303.000000 23.000000

There are outliers - therefore observations shall be cut. After playing with data lower and upper 5 percentile will be cut off.

In [9]:
# 2. Getting rid of outliers
q_low = df_cancellation_not_blank["cancellations_time_in_seconds"].quantile(0.05)
q_hi  = df_cancellation_not_blank["cancellations_time_in_seconds"].quantile(0.95)

df_filtered = df_cancellation_not_blank[(df_cancellation_not_blank["cancellations_time_in_seconds"] < q_hi) & (df_cancellation_not_blank["cancellations_time_in_seconds"] > q_low)]

# . 05-95 percentile
df_filtered.describe()
Out[9]:
origin_longitude origin_latitude m_order_eta order_gk order_status_key is_driver_assigned_key cancellations_time_in_seconds order_daytime
count 6534.000000 6534.000000 2469.000000 6.534000e+03 6534.0 6534.000000 6534.000000 6534.000000
mean -0.964093 51.450704 444.827055 3.000597e+12 4.0 0.377870 127.886134 12.642485
std 0.021469 0.011529 292.153826 2.432113e+07 0.0 0.484892 104.830568 7.445468
min -1.066952 51.399523 60.000000 3.000550e+12 4.0 0.000000 12.000000 0.000000
25% -0.973802 51.444953 228.000000 3.000583e+12 4.0 0.000000 50.000000 7.000000
50% -0.966762 51.452512 411.000000 3.000594e+12 4.0 0.000000 99.000000 13.000000
75% -0.950111 51.456726 657.000000 3.000623e+12 4.0 1.000000 178.000000 20.000000
max -0.867088 51.495445 1559.000000 3.000633e+12 4.0 1.000000 543.000000 23.000000
In [10]:
df_filtered_driver_1 = df_filtered[df_filtered['is_driver_assigned_key'] == 1]
df_filtered_driver_0 = df_filtered[df_filtered['is_driver_assigned_key'] == 0]

def filteredDriver(df, group_by_list, agg_what, agg_function_list, new_cols_names_list):
    df_g = df.groupby(group_by_list).agg({agg_what: agg_function_list})
    df_g.columns = new_cols_names_list
    df_g = df_g.reset_index()
    return df_g
    
df_filtered_driver_1_g = filteredDriver(df_filtered_driver_1
                                                                   , ['order_daytime']
                                                                   , 'cancellations_time_in_seconds'
                                                                   , ['mean']
                                                                   , ['cancellations_MEAN_d1'])
df_filtered_driver_0_g = filteredDriver(df_filtered_driver_0
                                                                   , ['order_daytime']
                                                                   , 'cancellations_time_in_seconds'
                                                                   , ['mean']
                                                                   , ['cancellations_MEAN_d0'])


df_drivers_0_1_g = pd.merge(df_filtered_driver_1_g, df_filtered_driver_0_g, how="inner", on = 'order_daytime')

df_drivers_0_1_g[['cancellations_MEAN_d1', 'cancellations_MEAN_d0']].plot(color = ['#8B008B', '#00CED1'], legend = 1)
Out[10]:
<AxesSubplot:>

Conclusions

On average, cancelation time extends when a driver is assigned. Also overall cancelation time extends during lover human activity hours (from midnight to early morning). In order to get more detailed conclusions, the above statistics should broken down including cancelation type well.

Distribution of the mean ETA over the hours

In [11]:
df_eta_not_blank = df_orders.dropna(subset=['m_order_eta'])
df_eta_not_blank_mean = df_eta_not_blank.groupby('order_daytime')['m_order_eta'].mean()

df_eta_not_blank_mean.plot(kind='bar', color = '#20B2AA')
Out[11]:
<AxesSubplot:xlabel='order_daytime'>

Conclusions

Average estimated time of arrival gets longer during busy hours (in the morning and in the afternoon).
This can be also explained by the traffic - during busy hours its harder to get there on time.

Hourly breakdown of the average number of drivers who were offered an order

In [12]:
df_joined = pd.merge(df_orders, df_offers, how="inner", on = 'order_gk')
df_joined['order_datetime_DT'] = df_joined['order_datetime'].apply(pd.to_datetime)
df_joined['order_daytime'] = pd.to_datetime(df_joined['order_datetime_DT']).dt.hour
df_joined.head(3)
Out[12]:
order_datetime origin_longitude origin_latitude m_order_eta order_gk order_status_key is_driver_assigned_key cancellations_time_in_seconds order_datetime_DT order_daytime offered_driver_id
0 18:08:07 -0.978916 51.456173 60.0 3000583041974 4 1 198.0 2023-02-06 18:08:07 18 300020372
1 20:57:32 -0.950385 51.456843 NaN 3000583116437 4 0 128.0 2023-02-06 20:57:32 20 300022635
2 20:57:32 -0.950385 51.456843 NaN 3000583116437 4 0 128.0 2023-02-06 20:57:32 20 300014328
In [13]:
df_count_drivers = df_joined.groupby(['order_daytime', 'order_gk']).agg({'offered_driver_id': ['count']})
df_count_drivers.columns = ['drv_count']
df_count_drivers = df_count_drivers.reset_index()
df_count_drivers

df_drivers_per_hour_mean = df_count_drivers.groupby('order_daytime').agg({'drv_count': ['mean', 'median']})

df_drivers_per_hour_mean.plot(kind='bar', color = ['#20B2AA', '#FFE4C4'], legend = 0)
Out[13]:
<AxesSubplot:xlabel='order_daytime'>
In [14]:
df_drivers_per_hour_mean.describe()
Out[14]:
drv_count
mean median
count 24.000000 24.000000
mean 3.867662 3.458333
std 0.581858 0.721060
min 2.543478 2.000000
25% 3.587739 3.000000
50% 3.873392 3.500000
75% 4.285998 4.000000
max 4.623952 5.000000

Conclusions

I would not say there's a shortage nor excess of drivers. When considering the mean - yes, perturbations occur. It should be kept in mind that, between 3 and 6 am the traffic should be lower, therefore the drivers buffor does not need to be kept around the average. When it comes to the excess of drivers, there is a portential for optimalization between 6pm and 3 am, when there little traffic, eta around the average, yet high number of avalible drivers. But this should be confirmed by general purpose analysis.

Hexagons

In [15]:
# imports for this one

import h3
from folium import Map, Marker, GeoJson
import folium
import branca.colormap as cm
from geojson.feature import *
import json
# took me an hour to get working h3 package... so take your time
In [16]:
h3_df = df_orders
h3_df = h3_df[['order_gk', 'origin_latitude', 'origin_longitude']]
h3_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10716 entries, 0 to 10715
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_gk          10716 non-null  int64  
 1   origin_latitude   10716 non-null  float64
 2   origin_longitude  10716 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 251.3 KB
In [17]:
# THIS WHOLE PART IS COPIED FROM https://github.com/ak4728/h3-nyc-taxi and changed a bit

def counts_by_hexagon(df, resolution):
    """
    Use h3.geo_to_h3 to index each data point into the spatial index of the specified resolution.
    Use h3.h3_to_geo_boundary to obtain the geometries of these hexagons
    
    Ex counts_by_hexagon(data, 9)
    """
    df = df[["lat","lng"]]
    
    df["hex_id"] = df.apply(lambda row: h3.geo_to_h3(row["lat"], row["lng"], resolution), axis = 1)
    
    df_aggreg = df.groupby(by = "hex_id").size().reset_index()
    df_aggreg.columns = ["hex_id", "value"]
    
    df_aggreg['geometry'] = df_aggreg.hex_id.apply(lambda x: { 'type' : 'Polygon',
    'coordinates': [h3.h3_to_geo_boundary(h=x,geo_json=True)]
    })

    return df_aggreg

def hexagons_dataframe_to_geojson(df_hex, file_output = None):
    """
    Produce the GeoJSON for a dataframe that has a geometry column in geojson 
    format already, along with the columns hex_id and value
    
    Ex counts_by_hexagon(data)
    """    
    list_features = []
    
    for i,row in df_hex.iterrows():
        feature = Feature(geometry = row["geometry"] , id=row["hex_id"], properties = {"value" : row["value"]})
        list_features.append(feature)
        
    feat_collection = FeatureCollection(list_features)
    
    geojson_result = json.dumps(feat_collection)
    
    #optionally write to file
    if file_output is not None:
        with open(file_output,"w") as f:
            json.dump(feat_collection,f)
    
    return geojson_result


def choropleth_map(df_aggreg, border_color = 'black', fill_opacity = 0.7, initial_map = None, with_legend = False,
                   kind = "linear"):
    
    """
    Creates choropleth maps given the aggregated data.
    """    
    #colormap
    min_value = df_aggreg["value"].min()
    max_value = df_aggreg["value"].max()
    m = round ((min_value + max_value ) / 2 , 0)
    
    #take resolution from the first row
    res = h3.h3_get_resolution(df_aggreg.loc[0,'hex_id'])
    
    if initial_map is None:
        initial_map = Map(location= [51.456843, -0.950385], zoom_start=12, tiles="cartodbpositron", 
                attr= '© <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors © <a href="http://cartodb.com/attributions#basemaps">CartoDB</a>' 
            )
        

    #the colormap 
    #color names accepted https://github.com/python-visualization/branca/blob/master/branca/_cnames.json
    if kind == "linear":
        custom_cm = cm.LinearColormap(['green','yellow','red'], vmin=min_value, vmax=max_value)
    elif kind == "outlier":
        #for outliers, values would be -11,0,1
        custom_cm = cm.LinearColormap(['blue','white','red'], vmin=min_value, vmax=max_value)
    elif kind == "filled_nulls":
        custom_cm = cm.LinearColormap(['sienna','green','yellow','red'], 
                                      index=[0,min_value,m,max_value],vmin=min_value,vmax=max_value)
   

    #create geojson data from dataframe
    geojson_data = hexagons_dataframe_to_geojson(df_hex = df_aggreg)
    
    #plot on map
    name_layer = "Choropleth " + str(res)
    if kind != "linear":
        name_layer = name_layer + kind
        
    GeoJson(
        geojson_data,
        style_function=lambda feature: {
            'fillColor': custom_cm(feature['properties']['value']),
            'color': border_color,
            'weight': 1,
            'fillOpacity': fill_opacity 
        }, 
        name = name_layer
    ).add_to(initial_map)
    #add legend (not recommended if multiple layers)
    if with_legend == True:
        custom_cm.add_to(initial_map)
    
    
    
    return initial_map

def plot_scatter(df, metric_col, x='lng', y='lat', marker='.', alpha=1, figsize=(16,12), colormap='viridis'):
    """
    Scatter plot function for h3 indexed objects
    """    
    df.plot.scatter(x=x, y=y, c=metric_col, title=metric_col
                    , edgecolors='none', colormap=colormap, marker=marker, alpha=alpha, figsize=figsize);
    plt.xticks([], []); plt.yticks([], [])
In [18]:
h3_df.columns = ['dt','lat','lng']
df_aggreg = counts_by_hexagon(df = h3_df, resolution = 8)
df_aggreg.sort_values(by = "value", ascending = False, inplace = True)
hexmap = choropleth_map(df_aggreg = df_aggreg, with_legend = True)
folium.map.LayerControl('bottomright', collapsed=False).add_to(hexmap)
hexmap
Out[18]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [19]:
cumsum = (df_aggreg["value"]/df_aggreg["value"].sum()).sort_values(ascending = False).cumsum()
df_aggreg["is80perc"] = pd.Series(0, index=cumsum.index).where(cumsum >= 0.8, 1)

hex_80perc = df_aggreg["is80perc"].sum()
print('Number of hexes that contain 80:', hex_80perc)
df_aggreg.head(hex_80perc+1)
Number of hexes that contain 80: 23
Out[19]:
hex_id value geometry is80perc
97 88195d2b1dfffff 1497 {'type': 'Polygon', 'coordinates': [((-0.96741... 1
96 88195d2b1bfffff 870 {'type': 'Polygon', 'coordinates': [((-0.95072... 1
93 88195d2b15fffff 774 {'type': 'Polygon', 'coordinates': [((-0.97531... 1
91 88195d2b11fffff 707 {'type': 'Polygon', 'coordinates': [((-0.96301... 1
95 88195d2b19fffff 667 {'type': 'Polygon', 'coordinates': [((-0.95512... 1
20 88195d284dfffff 653 {'type': 'Polygon', 'coordinates': [((-0.94632... 1
63 88195d2a27fffff 414 {'type': 'Polygon', 'coordinates': [((-0.93842... 1
89 88195d2b0bfffff 372 {'type': 'Polygon', 'coordinates': [((-0.97182... 1
62 88195d2a25fffff 362 {'type': 'Polygon', 'coordinates': [((-0.94282... 1
92 88195d2b13fffff 346 {'type': 'Polygon', 'coordinates': [((-0.95862... 1
85 88195d2b03fffff 257 {'type': 'Polygon', 'coordinates': [((-0.97972... 1
94 88195d2b17fffff 210 {'type': 'Polygon', 'coordinates': [((-0.97091... 1
108 88195d2b39fffff 184 {'type': 'Polygon', 'coordinates': [((-0.98761... 1
23 88195d2861fffff 182 {'type': 'Polygon', 'coordinates': [((-0.97441... 1
60 88195d2a21fffff 156 {'type': 'Polygon', 'coordinates': [((-0.93053... 1
110 88195d2b3dfffff 153 {'type': 'Polygon', 'coordinates': [((-0.99992... 1
104 88195d2b31fffff 143 {'type': 'Polygon', 'coordinates': [((-0.99551... 1
27 88195d2869fffff 125 {'type': 'Polygon', 'coordinates': [((-0.96651... 1
109 88195d2b3bfffff 115 {'type': 'Polygon', 'coordinates': [((-0.98321... 1
117 88195d2b51fffff 98 {'type': 'Polygon', 'coordinates': [((-0.95162... 1
120 88195d2b57fffff 92 {'type': 'Polygon', 'coordinates': [((-0.95952... 1
61 88195d2a23fffff 91 {'type': 'Polygon', 'coordinates': [((-0.92613... 1
119 88195d2b55fffff 85 {'type': 'Polygon', 'coordinates': [((-0.96392... 1
88 88195d2b09fffff 81 {'type': 'Polygon', 'coordinates': [((-0.97622... 0
In [20]:
df_aggreg_top1 = df_aggreg.head(1).reset_index()
hexmap = choropleth_map(df_aggreg = df_aggreg_top1, with_legend = True)

hexmap
Out[20]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Hex with the largest the number of fails on the map => the above map

In [21]:
HTML("""
<style>
h1 {
    text-align:center;
    color:#008080;
}
h2 {
    text-align:center;
    color:#008080;
}

div.inner_cell {
    background-color: #F0F8FF;
}

</style>
""")
Out[21]: